Triton MLA perf fixes#33529
Conversation
|
Related Documentation No published documentation to review for changes on this repository. |
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations for Triton MLA, particularly for long context lengths. The changes include enabling CUDA graph support for decode, which can reduce kernel launch overhead. A more adaptive heuristic for calculating num_kv_splits is introduced, which should improve parallelism on different hardware. The core Triton attention kernel has been optimized by improving memory access patterns for better coalescing, reordering loads to hide latency, and adding cache modifier hints for the Triton compiler. These are solid, expert-level optimizations that should lead to the claimed performance improvements. The changes look correct and well-implemented.
|
Hi @koush, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @koush, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
benchmarks on Kimi k2.5 Summary
|
76a0247 to
8e64508
Compare
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
8e64508 to
0d93dc7
Compare
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
|
@tjtanaa I tested -dcp with cuda graph and the model worked fine, but I did notice performance degradation compared to not using cuda graphs. I'm not sure what the expected behavior is supposed to be. Regardless, since dcp is the recommended way to run it, as it saves kv cache space by preventing data duplication, and cuda graphs reduced performance, I backed out that part of my change. |
I'm running this like |
Is this what you need? |
Signed-off-by: Koushik Dutta <koushd@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Song Kai <songkai05@baidu.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: jackcfwang <jackcfwang@tencent.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
Triton MLA on sm120 performance degrades on batch size 1 as context length increases.
This perf issue has been bugging me for a while as it made deepseek and Kimi k2 unusable, and when Kimi k2.5 was released I finally got around to digging into it. I'm familiar with w/ cuda but this is my first foray into triton.
The primary issues are suboptimal kv splitting during low batch count resulting in underutilized SM and unnecessary load of Q vector, since c_kv contains both.
Test Plan
Test Kimi 2.5, Deepseek v2, Qwen 235B.
Test Result
Models load and generate correctly, Kimi 2.5 shows notable improvement as seen below.
Testing with -dcp 8 and 80k context
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.